feat[vortex-cuda]: GPU FSST decompression kernel#7776
feat[vortex-cuda]: GPU FSST decompression kernel#7776
Conversation
Merging this PR will degrade performance by 10.6%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
246.1 ns | 275.3 ns | -10.6% |
Comparing asubiotto/fsst-cuda (bebed3a) with develop (903ee6c)
c982cd8 to
007bdab
Compare
a10b28e to
79d1f10
Compare
|
Very cool! |
79d1f10 to
46670e8
Compare
|
Thanks for the review! Addressed the comments. |
0ax1
left a comment
There was a problem hiding this comment.
Fantastic contribution, one question inline.
|
|
||
| // Prefix-sum lens to per-string u32 output offsets so the kernel | ||
| // knows where to write each decoded string. | ||
| #[expect(clippy::cast_possible_truncation)] |
There was a problem hiding this comment.
This looks unsound to me
There was a problem hiding this comment.
Yes, this is leftover from the hackathon. Ignore. I will template on output offsets.
There was a problem hiding this comment.
Or just use u64, but that'd be a little wasteful.
There was a problem hiding this comment.
I chose to just use u64, not sure templating is worth it. LMK and I can change the approach.
| let output_offsets: Vec<u32> = match_each_integer_ptype!(lens.ptype(), |P| { | ||
| let mut out = Vec::with_capacity(lens.len() + 1); | ||
| let mut acc: usize = 0; | ||
| out.push(0u32); | ||
| for &l in lens.as_slice::<P>() { | ||
| acc += l as usize; | ||
| out.push(acc as u32); | ||
| } | ||
| out | ||
| }); |
There was a problem hiding this comment.
would be nice if this was computed in the kernel too?
There was a problem hiding this comment.
How would you suggest to do this? The kernel needs to know the output offsets in order to decode and I think the only thing that would work would be to do an O(n^2) prefix sum of all lengths for every string. I think it's probably cheaper to execute this linear computation with dependencies on the CPU. A GSST encoding would obviate the need for this so I would just punt on this.
This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the `throughput_cuda` benchmark on a DGX spark. CPU work is required to compute the output offsets. The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64). The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there. Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M. Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying. Signed-off-by: Alfonso Subiotto Marques <alfonso.subiotto@polarsignals.com>
46670e8 to
bebed3a
Compare
|
Updated to use u64 output offsets and cleaned up the kernel parameters into an args struct. |
|
Gave the benchmark a spin also on my end on a GH200 (which has 4 TB/s memory bandwidth): So the main bottleneck for this kernel is warp divergence: For comparison we e.g. get for runend. |
|
|
||
| use crate::timed_launch_strategy::TimedLaunchStrategy; | ||
|
|
||
| const BENCH_SIZES: &[(usize, &str)] = &[(1_000_000, "1M"), (5_000_000, "5M"), (10_000_000, "10M")]; |
There was a problem hiding this comment.
At this point we run all benchmarks for vortex-cuda with 100M and re-use the constant from vortex-cuda/benches/bench_config/mod.rs. 10M inputs proved to be too noisy when running the benchmarks in codspeed and can't saturate larger GPUs like a GH200.
There was a problem hiding this comment.
Yeah the difference is that with FSST each input is a clickbench URL. Happy to use 100M if you want, but I think that's around 10GB.
There was a problem hiding this comment.
Yeah fair, we can keep 10M but let's add comment on the constant why it's diff here or so.
There was a problem hiding this comment.
Oh we also need to wire in the bench to codspeed in: .github/workflows/codspeed.yml. We can put it next to zstd which is nice for comparison. - { shard: 7, name: "Encodings 4", packages: "vortex-sparse vortex-zigzag vortex-zstd" }.
Nice, wish I had one of those lying around 😂 Yes, that's the main idea behind the GSST splits. Precomputing splits on the CPU showed +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M on the DGX spark. While we can precompute splits I think maybe it's a little too much CPU prep and this is rather a motivation for the GSST encoding? Happy to introduce splits in a follow up PR if we want to do that. |
Heh, I mean this is clearly not a blocker but I wanted to pin down the exact numbers. Can we def land as is from a perf end. |
| let host_bytes = CudaDeviceBuffer::new(device_output) | ||
| .copy_to_host(Alignment::new(1))? | ||
| .await?; | ||
| let host_bytes = host_bytes.slice(0..total_size); | ||
|
|
||
| let (buffers, views) = match_each_integer_ptype!(lens.ptype(), |P| { | ||
| build_views( | ||
| 0, | ||
| MAX_BUFFER_LEN, | ||
| host_bytes.into_mut(), | ||
| lens.as_slice::<P>(), | ||
| ) | ||
| }); |
There was a problem hiding this comment.
Can we build the views on the GPU?
There was a problem hiding this comment.
Probably since we already compute the output offsets. It's just a question of modding by i32::MAX to split up the output into a couple of buffers.
Summary
This commit implements on-GPU decompression of the existing FSST encoding. This kernel achieves ~42% max throughput utilization as compared to the
throughput_cudabenchmark on a DGX spark. CPU work is required to compute the output offsets.The core performance win is buffering up to 24 bytes of decompressed data in three u64 registers and emitting the widest aligned stores possible up to u128 (st.global.v2.u64).
The 256-entry symbol table (≤ 2 KB) is read directly from global memory. Staging it into shared memory measured ~3% slower at 10M rows and ~15% slower at 1M rows. The hypothesis is that L1 already holds the table after a few iterations and the explicit shared copy adds bank-conflict latency on the warp-divergent symbols[code] reads; the gap is wider at 1M because the kernel is less bandwidth-bound there.
Further optimizations would require an encoding change. Splits-style intra-string parallelism (one GPU thread per ~32-compressed-byte chunk instead of per-string) was prototyped on top of this kernel and measured an additional +30% kernel throughput at 1M clickbench URLs, +26% at 5M, +12% at 10M.
Four kernel variants are generated for the unsigned widths of codes_offsets (u8/u16/u32/u64); signed integer ptypes are reinterpreted as their unsigned equivalent on the Rust side, so the bit pattern is preserved without copying.
Addresses: #6538
Testing
Unit tests against the CPU implementation on small and larger dataset.